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Abstract: 

A popular approach to nonparametric option pricing is the Minimum Cross Entropy 
(MCE) method based on minimization of the relative Kullback-Leibler entropy of the price 
density distribution and a given reference density, with observable option prices serving 
as constraints. When market prices are noisy, the MCE method tends to overfit the data 
and often becomes unstable. We propose a non-parametric option pricing method whose 
input are noisy market prices of arbitrary number of European options with arbitrary 
maturities. Implied transition densities are calculated using the Bayesian inverse theory 
with entropic priors, with a reference density which may be estimated by the algorithm 
itself. In the limit of zero noise, our approach is shown to reduce to the canonical MCE 
method generalized to a multi-period case. The method can be used for a non-parametric 
pricing of American/Bermudan options with a possible weak path dependence. 
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1 Introduction 



The linear inverse theory deals with the problem of reconstruction of a signal {/j} out of 
data (images) D = {di, . . . , dm} which are obtained from a signal by a linear operator K, 
and may be corrupted with a noise e: 

n 

di^J2^ijfj+^i « = l,---,"7, (1) 
J=l 

Since the number of available data m is typically much smaller than the dimension n of 
the probability space, inverse problems are highly undcrdetermincd and ill-posed. Addi- 
tional considerations, such as a prior knowledge and/or guiding regularization principles, 
become of a critical importance for solving such a problem. Inverse problems are widely 
encountered in fields as diverse as image processing, geophysics, speech recognition, com- 
putational linguistics, molecular biology, econometrics etc. Much progress in developing 
reliable and robust methods of inversion was made in these areas over last decades, with 
mainstream efforts concentrated on Bayesian and Maximum Entropy (MaxEnt) or Mini- 
mum Cross Entropy (MCE) methods, see e.g. [1, 2, 3, 4]. 

A popular paradigm in option pricing theory is the nonparametric option pricing 
which is based on inferring a (risk-neutral) price probability density out of market prices 
of liquid assets, and then calculating prices of other, typically less liquid instruments as 
convolutions of payoff functions with the inferred distribution. Depending on the data, a 
solution may not exist (that would signal arbitrage opportunities) or, more often, there 
may be multiple probability measures consistent with the same data (this means that 
the market is incomplete), thus the question of choosing the "right" solution arises. The 
problem of obtaining asset price distributions out of option prices is thus a typical (linear) 
inverse problem, which can be addressed using well established methods of the inversion 
theory. 

One popular method to find implied densities is to use the MCE inversion method 
as advocated and developed by Jaynes [1], without invoking Bayesian arguments. This 
approach was initiated in the financial literature by Gulko [5], Buchen and Kelly [6], and 
Stutzer [7]. Within this method, the asset price distribution p{x) is found by requiring 
that it should minimize the relative entropy of p{x) and a given reference distribution 
q{x), while reproducing the market prices of options in a chosen calibrating set (with all 
options having the same maturity). The relative entropy of two distributions p{x) and 
q{x) is given by the so-called KuUback-Leibler (KL) distance [8] 

D\p{x)\\q{x)] = J dx p{x) log-^^ (2) 

As log(l/a;) is a convex function, Jensen's inequality 

J dyp{y)g[q{x)] > 9 (^J dyp{y)q{y)j (3) 

valid for any convex function g and probability p[y) such that / dyp{y) — 1, shows that 
L>[p(a;)||g(a;)] > for any distributions p{x),q{x). 
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The reference density q{x) in (2) accumulates our prior knowledge and/or expectation 
for the answer. There exist several possibilities to choose such a density. One of them 
is to take q{x) to be the historical asset price distribution, as suggested by Stutzer [7]. 
One the other hand, if no historical information is available or if one believes that the 
risk-neutral distribution should be very different from the historical one, one may take 
a lognormal distribution or even a uniform reference distribution q{x) — const. In the 
latter case, the KL distance becomes the (minus) absolute entropy of distribution p{x). 
The principle of maximizing the absolute entropy of a distribution (MaxEnt) is thus but a 
particular case of a more general principle of minimizing the relative KL entropy (MCE). 

With a uniform reference density q{x) — const, the MaxEnt method gives the least 
informative and least prejudiced asset price distribution compatible with the market data. 
As was stressed by Gulko (who considered the case q{x) = const only), the MaxEnt 
principle thus formalizes and quantifies the Efficient Market Hypothesis in the form of 
a static characterization of market beliefs about future prices: the distribution must be 
such that it maximizes uncertainty about the future price movements. By the meaning 
of entropy [8], this means that the distribution should be of a highest possible absolute 
entropy, i.e. the lowest possible KL distance to a uniform distribution. The argument thus 
obviously generalizes to the case of a non-uniform density q{x) expressing the market belief 
about the future: the most probable distribution p{x) should minimize the KL distance 
to the reference distribution q{x), while being consistent with the market data. The MCE 
approach thus provides a consistent and theoretically sound method of "completing the 
market", i.e. choosing the "best", maximally unbiased risk-neutral probability measure 
if several possibilities compatible with the data are possible. 

In the early works [5, 6, 7], the MCE method was applied to a simplest one-period 
model, with all options in the calibrating set having the same maturity. Furthermore, 
no noise was supposed to be present in the data. Later, the method was extended to a 
multi-period setting by Avellaneda and co-workers [9] using methods of stochastic optimal 
control theory, again without allowing for a noise in the data. For other recent applications 
of the canonical MCE approach, see Derman and Zou [10] and Stutzer [11]. 

In reality, noise is always present in option prices, and can have a number of sources 
including non-synchronicity of trading, straight-out reporting errors, different trader per- 
spectives of the market, etc. As was found by Buchen and Kelly [6] with Monte Carlo 
simulation, the MCE method often becomes unstable even for a modestly low level of 
noise in the data, leading sometimes to a fake multi-nodal form of an implied distribu- 
tion. This behavior is traced back to the construction of the canonical MCE method (see 
below in Sect. 3) which enforces exact matching of the market prices and thus tends to 
overfit the data. These problems have led Buchen and Kelly to the conclusion that the 
method should be abandoned altogether [13] in favor of the so-called geophysical inverse 
theory based on a completely different approach. However, a way to overcome these diffi- 
culties within a natural generalization of the canonical entropy approch has been known 
for a long time to information theory researchers: It was stressed by Jaynes as early as 
in 1982 [12] that when the data are noisy, the MaxEnt and MCE methods are unsuitable 
and should be substituted by the Bayesian inverse theory in combination with the 
MCE principle. There exists an extensive literature [12, 3] documenting both the need 
and the way to modify the canonical MCE method for dealing with noisy data. 
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The contribution of the present work is two-fold. First, we develop a Bayesian Inverse 
Theory approach to implied option pricing with noisy data, that naturally generalizes 
the canonical MCE method and reduces to the latter in the limit of zero noise. The 
only restriction we impose on the stochastic process for the asset price is the Markovian 
property. Beyond this, its nature (diffusion, jump- diffusion, etc.) is left unspecified. 
Second, we extend the method to the case of calibration to options of arbitrary maturities 
rather than one fixed maturity as in [5, 6, 7]. In contrast to Avellaneda et.al [9], we 
build a recursive procedure. We first calculate conditional transition probabilities with 
transition times being the adjacent maturities of options in the calibrating set. This is 
performed recursively starting from a shortest maturity. When this is done, we may, 
if needed, fill the model "in between" introducing intermediate time levels between the 
shortest and the longest maturities of the calibrating set. This may be necessary e.g. for 
pricing of a Bermudan option with exercise dates some (or all) of which do not belong 
in the given set of maturity dates. Our method allows one to introduce an arbitrary 
number of intermediate time levels and calculate corresponding conditional transition 
densities. The method involves two adjustable parameters related to modeling the noise 
and treating the reference density (see below) that may be either estimated from the 
data, or serve as a measure of the model risk. As the output of our method is a set of 
conditional transition densities, it may be considered as a generalization of the implied 
tree methodology of Rubinstein [14] and Derman and Kani [15] (see [16] for a review). 
In this class of models, closest to ours is presumably a semirecombining implied tree 
construction of Brown and Toft [17]. Similarly to the method of [17], we first find the 
conditional transition densities between different maturity dates in the calibrating set, 
and then calculate distributions for intermediate times of interest such as exercise dates 
for a Bermudan option. However, in contrast to [17], we do not discretize a stochastic 
process for the underlying, do not introduce further time levels and, most important, 
use a different approach to fix the transition densities, which also allows one to account 
for the noise in data. Computationally, our method is equivalent to performing a single 
MCE optimization for each maturity where data are available, and calculating four one- 
dimensional integrals for each additional time level. In addition, our algorithm is flexible 
in handling the reference density: at each step, it is generally given by a linear combination 
of a prior (e.g. historical) density and a conditional density of a previous step evolved to 
the current time (see Eq.(7) below). The attractive feature of allowing for a non- vanishing 
value of < p < 1 is that it smoothes the transition density in the time direction. 

Our presentation is organized as follows. In Sect. 2 we introduce the basic recursive 
algorithm. The subsequent sections 3 to 7 describe a single time step of the algorithm. In 
Sect. 3 we outline the MCE approach to the problem valid when no noise is present in the 
data. The Bayesian entropic inverse theory approach to the problem needed when data 
are noisy is presented in Sect. 4. In Sect. 5 we explain how the noise variance should be 
integrated out in our Bayesian setting. A resulting non-linear optimization problem with 
an objective function (which we call the effective Lagrangian using physics' nomenclature) 
is formulated in Sect. 6, and its convexity properties are studied. The actual minimization 
of the effective Lagrangian is performed in Sect. 7 using a linearization trick that leads to 
a drastic reduction in the computational cost. Sect. 8 explains how to calculate transitions 
densities involving additional time levels. The final Sect. 8 contains our conclusion. 
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2 Recursive Algorithm 



We assume that we are given an ordered set of maturities {Ti, . . . , T„}. For each maturity 
Ti,i — 1, . . . ,n, there are iVj available European option prices. The option payoff functions 
are supposed to be linearly independent [6]. 

Our scheme can be introduced as follows. Let Xi, . . . , be the asset prices at times 
{Ti, . . . ,T„}, respectively. We assume that a stochastic process governing the evolution 
of the asset price is Markovian. Time today is t = 0. Consider now the time moment 
Tj_i for some i. We ask what the asset price distribution should be at time Tj. Had we 
known the asset price — x at time Ti_i, following the MCE approach of [5, 6, 7] we 
would minimize the KL entropy for the distribution p(Xj|Xj_i = .Xj^i). Since at most we 
might know only the distribution of the asset price at a future moment Tj_i, the 

best we can do is to require that KL distance should be minimized on average. In other 
words, we require minimization of the conditional KL entropy [8] 

dxp{xi_i) / dyp{xi\xi_i) log '~ , (4) 

J \q{Xi\Xi-i) J 

with respect to p{xi\xi-i) for each i — 1, . . . ,n, subject to the condition of matching the 
observed market option prices. This suggests a forward recursive algorithm for solving the 
problem. (The matching of data can be done either in the form of absolute constraints 
or in the Bayesian framework via a likelihood function. As will be clear below, the latter 
preserves the forward structure of the relations between different distributions.) 

It should be noted that this recursive algorithm is equivalent to a direct minimization 
of the KL entropy of the joint distribution p{Xi, . . . , X^) 

D\p{xi,...,Xn)\\q{xi,...,Xn)] = / dxi . . . dXnP{xi, ...,Xn) log ^^ ' " " " ' ^"j (5) 

J q[Xi,...,Xn) 

Indeed, using the chain rule for relative entropy [8], we obtain 

n 

D\p{xi, . . .,Xn)\\q{xi, . . .,Xn)] = ^Ipi^^l^i-'^^ ' ' ■ , Xl)\\q{Xi\Xi-i, . . .,Xi)] 

1=1 

n 

1=1 

where on the last step we used the Markov property . . . ,Xi) = 

(and analogously for the reference distribution). As each term in the series (6) is non- 
negative, minimization of the whole expression amounts to a sequential minimization of 
each successive term in the series^. This is a special property of a Markov process which 
enables one to use a forward calculation to calculate probabilities^. The two schemes 
produce identical results, as can also be checked in a simple two-period setting. We note 
that it is the scheme of minimization of the joint KL distance (5) that was adopted by 

^This statement is valid assuming that a minimum is unique. As will be argued below in Sect. 3, in 
our optimization problem the minimum, if exists, is unique. 

similar forward procedure is used for implied trees in the approaches of [14, 15, 17]. 
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Avellaneda et.al [9]. We prefer the former forward recursive scheme because it is easier 
to implement, and also because the conditional transition probabilities needed for pricing 
other securities are its direct outputs, while in the scheme adopted in [9] their calculation 
includes additional integrations. Last but not least, a generalization to the case of noisy 
data is quite straightforward in the recursive scheme but not in the method of [9]. 

Our recursive algorithm proceeds as follows. Suppose that the problem is solved up to 
time Tfc, i.e. we know all conditional transition probabilities p{y,Ti^i\x,Ti) for i < k as 
well as the marginal density (conditional on the initial price) p{x) = p{x,Tk\xQ,0) where 
time today is t = and todays asset's value is Xq. All reference conditional transition 
densities are calculated as well. The recursion step is then performed as follows. 

1. Set the transition times t = Tf. ,T = Tfe+i. 

2. Calculate the conditional transition probability p{y\x) = p[y,T\x,t) of the asset price 
distribution at time T conditional on its value at time t. The calculation is performed 
using the Bayesian inverse theory, and requires the knowledge of the marginal density 
p{x) at time t. It also involves the reference conditional transition density q{ii\x) for this 
step. 

3. Calculate the reference conditional transition density for the next step. To this end, we 
shift all time arguments (which enter the calculation through options payoff functions) in 
p{y,Tk+i\x,Tk) to the next time level (Tfc,Tfc_,_i) (T^+i, T^+s). CaU this density qs{y\x). 
A general form for the reference density for the next step thus would be 



where qh{y\x) stands for the conditional density from time T^+i to time Tfc_,_2 calculated 
with a prior model (or a historical price distribution), and < p < 1 is a free parameter 
of the model. The limit p ^ 1 corresponds to maximum smoothing of a conditional 
transition density in the time direction^ 
4. Calculate a new marginal density 



to be used to find the conditional transition density on the next step. 

5. Increment k ^ k + 1. 

6. If k < n — 1, go to step 1. 

7. If needed, calculate conditional transition densities involving additional time levels. 
In what follows. Sects. 3 to 7 describe step 2 in the above scheme, while Sect. 8 explains 

the calculation on step 7. 

3 Canonical Minimum Cross Entropy Method 

Assuming that the unconditional density p{x) at time t = is known, we want to find 
the transition probability density p{y\x) for transition to the interval [y,y + dy] at time 

^The choice oiqs{y\x) as a prior distribution is reasonable if the real process is stationary, which means 
that the conditional density p{y,T\x,t) depends only on the difference T — t. We can further neglect 
an implicit dependence of Lagrange multipliers onT — t, which is a good zero approximation so long as 
Tk+2 - Tfe+i is close (or equal) to T^+i - T^. 



q{y\x) = (1 - p) qh{y\x) + pqs{y\x) 



(7) 




5 



T = Tfe+i conditional on being in state x at time 7]. We require that p{y\x) should be 
close to a prior density q{y\x), i.e. should minimize the conditional KL entropy functional 

/f T){v\x] 
dxp{x) J dyp{y\x) log -^^^^ (8) 

subject to all constraints imposed on p{y\x). The first constraint is the normalization 
equation 

dyp{y\x) = 1 (9) 



that should be imposed for all values of x. Further constraints are imposed by matching 
observed option prices. We consider a set of A'" = A^^+i European options maturing at 
time T — Tfe+i, with {fi{x,y)} being the payoff functions. For the sake of generality, we 
retain the x argument in fi{x, y), which may account for a certain path dependence. More 
common, however, is calibration to European vanilla call and put options whose payoff 
functions fi{x,y) = fi{y) are 

fi{y) = D{T-t)uiax{y-Ki,Q) {call) (10) 
U{y) = D{T-t)uvax{K,-y,Q) (put) (11) 

where {Ki} are strikes and D{T — t) are (deterministic) discount factors: D{T — t) ~ 
exp[— r(T — t)]. An exact matching of market prices imposes the constraints 

I dxp{x) I dyp(y\x)Mx, y) = Q , i^l,...,N (12) 

where Ci stands for a (noiseless) price of the ith option. 

The problem of minimization of (8) subject to constraints (9) and (12) is a standard 
problem of constrained optimization that is solved by the Largange multiplier method. To 
satisfy the normalization constraint (9) for each value of ,t, one needs a Lagrange multiplier 
function C,{x)p{x). Here the multiplicative factor p{x) (known from the previous step) is 
introduced for a later convenience. Constraints (12) are imposed using a discrete set of 
N Lagrange multipliers {Aj}. We thus arrive at the following Lagrangian function 

L = y dxp{x) J dyp{y\x)\og^^^j^ - J dxi{x)p{x) (^J dyp{y\x) - 

- ^^'{J d'^Pi^) J dyp{y\x)fi{x,y)-C^ (13) 

The Lagrangian (13) is minimized when its total variation vanishes, i.e. 

5L 5L dL 

0, -_- = 0, — = , t = l,...,N (14) 



Sp{y\x) 5^{x) dX 

The solution of the first two equations in (14) reads 



P{y\x) = 1^ exp \Mx, y)] (15) 
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where 

Z\(x) ^ J dy q(y\x) exp \ifi{x, y)^ (16) 

What remains is to ehminatc the Lagrange multiphers Aj by substituting the solution (15) 
in the last of Eqs.(14). This results in a set of nonlinear equations 



//p(a;)^^exp(^^A./,(a;,|/)j/,(a;,|/) = Q , z = l,...,iV 



(17) 



This problem can be solved e.g. by a multidimensional Newton- Raphson procedure [18]. 
As shown in [6] , the Jacobian matrix in this problem is given by the covariance matrix of 
the constraint functions 

Jij = cov [fi{x, y), fj{x, y)] (18) 

and thus is positive definite for linearly independent constraints, which is a necessary 
condition for the Newton-Raphson method to be applicable. In addition, Jij should be 
well-conditioned to ensure stability of the algorithm. As discussed in [6], the Jacobian 
may become ill-conditioned if market prices are noisy. In numerical solution 

for the Lagrange multipliers may become unstable. 

However, market prices are always noisy. As the classical MCE method enforces the 
exact matching of constraints (12), it tends to overfit the data. As a result, implied MCE 
probability distributions often become unstable and juggled. Intuitively, we might want 
to substitute the exact constraints (12) by approximate ones that would smooth out the 
noise in the data. It will be shown in the next sections that the Bayesian inverse theory 
with entropic priors provides just such a sort of modification of the MCE method. 



4 Noisy Data: Bayesian Entropic Inverse Theory 

In Bayesian probability theory [1], the posterior probability P{M\D, I) of a model (prob- 
ability density, in our case) M given the data D and information / is determined by the 
Bayes formula 

P{M\D,I) = P{M\I)^^^^^ (19) 

where P{M\I) the prior probability of the model assigned before the data is observed, 
and P{D\M, I) is the likelihood function measuring probability to observe the data D 
given the model M. Finally, the so-called evidence P{D\r) is the total probability of 
observing the data given the information I. As it does not depend on the model M whose 
probability we want to calculate, it merely serves as an irrelevant normalization factor for 
the posterior probability P{M\D, I). 

To solve the inverse problem of finding the transition probabilities based on data and a 
prior knowledge, we apply the Maximum Aposteriori Probability (MAP) approach [1, 3]. 
According to this method, the most probable model is one that maximizes the posterior 
probability P{M\D,I) in (19) i.e. the product of the prior P(M\I) and the likelihood 
function P{D\M. I). Note here the difference with the usual Maximum Likelihood method 
which requires fixing the prior in advance and optimizing the likelihood function alone. 
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On the contrary, the MAP method allows the prior to carry unknown parameters and/or 
functional dependence. 

We start with assigning the prior probability P{M\I). The following combinatorial 
argument [19] that extends one suggested by Jaynes [1], is of use here (see also [20] for a 
recent discussion of entropic priors). For simplicity, we will first consider marginal, not 
conditional probabilities, and discretize the state space X — {1,2, Let qj be the 

probability of the jth element of X. We can imagine an urn with known proportions qj 
of balls of type j, for j = 1,2, . . . , k. If we randomly draw n balls from the urn, then 
the probability of observing Ui = PiU balls of each class i is given by the multinomial 
distribution 

Pip\q) = P{n,, ...,n,\q) = -—— • • • C (20) 

ni\n2i ■ ■ ■ Uki 

which is the probability of seeing a p-distribution when sampling n times from a q- 
distribution. We assume that n 3> 1. Taking the logarithm of both sides of (20) and 
using the Stirling formula 

log m! = m logm — m + o(m) (21) 

we obtain 

P{p\q) oc exp I —n^pjlog— | (22) 

Note that for the case of a uniform reference distribution qj — q — const, this line of 
reasoning boils down to Jaynes' definition of the prior 

P{p) oc exp ^- pj log j (23) 

while the exact value of parameter n becomes irrelevant in this limit, as will be clear 
below. Extending the argument to the case of conditional probabilities, we write down 
the prior probability of distribution p{y\x) in our problem: 



P\p\\q] = exp 



-h J dxp{x) J dyp(y|x)log^|j|| (/ dyp{y\x) - l) (24) 



where we added the 5-function constraint for each value of x to ensure the correct nor- 
malization. Here h ^ 1 is a parameter controlling the relative weight of the entropy term 
in the posterior probability. 

Our next task is to compute the hkelihood function P{D\M,I). To this end, we 
assume that noise in prices is Gaussian^: 

I dxp(x) I dyp(y\x)fi{x, y) = Ci + ei , i^l,...,N (25) 



^Note that the assumption of an additive noise in the prices is, strictly speaking, theoretically prob- 
lematic, as potentially it allows for negative prices. However, assuming that a noise is not too large, a 
probability of this to occur will be negligible. Yet, presumably a more realistic approach would be to 
model an additive noise for logarithms of option prices. We will leave a study of such a model for a future 
work. 
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where 

EN = , ne^^a^ (26) 
The hkehhood function is therefore given by the following expression: 



(27) 



If the prior distribution for the noise variance is given by some function P{(j), the full 
prior in our problem becomes ^ 

P{M\I)^P{a)P\p\\q] (28) 

where is given by (24). As we are not interested in estimating the noise variances 

erf, they represent the so-called nuisance parameters (hypcrparameters) that should be 
integrated out according to the Bayesian approach [1, 2]. Therefore, for the posterior 
probability we obtain 

P\p\D]^P\p\\q] lldaiP{a)P[D\p,a] (29) 

Jo Y 

where P [p\\q] and P [D\p, a] are defined in (24) and (27), respectively. According to the 
MAP approach, this final expression should be maximized with respect to p{y\x) in order 
to find the most probable transition probability p{y\x) compatible with the data while 
remaining as close as possible to the reference density q{y\x). 

Eq.(29) allows one to readily see how the usual MCE method is reproduced in the limit 
of zero noise. This limit corresponds to knowledge of the prior for noise with certainty: 

P{a)^Us{a,-ai'^) (30) 

i 

and taking the hmit af^ — > after the integral in (29) is calculated. This yields 

P \p\D] oc P \p\q] U^{J J dxdyp{x)p{y\x)Mx, y) - (31) 
Using the integral representation of the 5-function 

\ roo ][ rico 

5{x) = — / dujexTp^iux) = — : / dXexp{—Xx) (32) 
2tt J—oo 27ri J-ioo 

we see that in the hmit of zero noise the MAP approach exactly reproduces the MCE 
method, with the saddle point values of parameters A in (32) for all ^-functions in (31) 
being exactly equivalent to Lagrange multipliers in (13). It is easy to sec that the exact 
value of parameter h becomes irrelevant in this limit, as it can be eliminated by a rescaling 
of Lagrange multipliers. 

fully Bayesian approach would also require a specification of a prior for parameter h in (24). We 
will not attempt doing this, but will rather view h as a fixed parameter, and will study stability of our 
results with respect to variations of h. 
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5 Integrating out Noise Variance 



In general, a noise in option prices does persist. To integrate it out according to the 
Bayesian theory, we should define a model for the prior distribution P{cr). Given the 
prior, we then calculate the integral 

roo 

n\p]^ / Ud^iP{a)P[D\p,a] (33) 

Jo Y 

One possibility is to assume the (5-function prior (30) but with {o-.^^-*} fixed and positive, 
assumed to be known numbers. In this case the integral (33) is trivially calculated with 
the result P D\p, a^^'' . This is in spirit of the frequentist approach to statistics where the 
noise variance is a fixed albeit unknown number which can be estimated from the data. 

Another possibility is to choose a smooth continuous prior for the noise variance. If 
no information about the noise variance is available, one choice would be to introduce a 
completely non- informative prior P{ct) = const. As was noticed by Jaynes (see [1]) this 
choice is unsatisfactory on the grounds that it does not respect the invariance principle: 
if we change the variable a — > a', the prior P{a) will generally change. By requiring that 
the prior P{a) should be invariant with respect to a change of scale 

a* ^aa , P{a*) = J~^P{a) (34) 

where in our case the Jacobian J — a, we obtain the solution 

P{a) = ^ (35) 
a 

This is called Jeffrey's prior. It is of frequent use in Bayesian approaches. Although 
the prior (35) is unnormalizable (so-called improper prior), its substitution in (33) leads 
to a converging integral. Another approach considered in the Bayesian literature is to 
choose prior P{x) distributions for which a posterior distribution P{x\D) can be easily 
calculated. The most popular choice arc the so-called conjugate priors [4]. They possess 
the intuitive feature that starting with a certain functional form for the prior, one ends 
up with a posterior of the same functional form, but with parameters updated by the 
sample information. For a Gaussian noise, the conjugate prior is therefore given by the 
Gamma distribution in 1/(t~^ : 

P{a) oc (a^)!-" exp (-/3a-') (36) 

Here a > and /3 > 0. An improper prior corresponds to the limit a < 3/2 and /3 — > 0. 
Note also that the particular choice a = 1^/2, P — 1/2 corresponds to the Chi-square 
distribution with u degrees of freedom ^. 



^This may hint on a bridge to the classical approach to statistics whose fundamental theorem states 
that if 5^ is the variance of a random sample of size n from a normal distribution with the mean ^ and 
variance cr^, then the random variable (n — l)S^/a^ has a Chi-square distribution with (n — 1) degrees 
of freedom. 
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In what follows we analyze both 5-function (30) and conjugate (36) choices for the 
prior. Jeffrey's prior is recovered by the particular choice 

a=l , P = (37) 

in the conjugate prior (36). (As will be seen in the next section,this choice may however 
be undesirable as it does not lead to a convex optimization problem.) We also note that 
a conjugate prior p{x) oc x°'~^e~^^ can be viewed as the MCE prior maximizing entropy 
given integral constraints E[X] = gi , £^[log(X)] = g2. Recall that the entropy itself can 
be interpreted as an expected value H{X) = E[— log p{X)], so that the second constraint 
here actually fixes the entropy of the distribution. 



6 Effective Lagrangian 

If we choose a conjugate prior (36), integral (33) is calculated using the well-known formula 
for the Gamma distribution 

/ dyv'-'e-'^^^ma-" (38) 
Jo 

and assuming identical distributions for all (7j. It is convenient here to define the "effective 
potential" 

= -logl][p] (39) 
For the conjugate prior (36) we thus obtain 



N 



^cp\p] = (a - l)^log 



J j dxdyp{x)p{y\x)fi{x,y) -C^ +2/3 



(40) 



plus an inessential constant. If instead the 5-function prior (30) is used, the effective 
potential is (hereafter we omit the subscript (0) from af*) 

^df\p\ =J2^[J J dxdyp{x)p{y\x)fi{x, y) - c}j (41) 

Using (41), (40), (24) and (29), we find that the MAP probability density p{y\x) is 

p{y\x) = arg min maxLe//[p, (42) 

p{y\x) e,(x) 

with the following "effective Lagrangian" : 

'p{y\x)\ 



Leff\p,i\ ^ hj dxp{x) j dyp{y\x)\og 



I dxi{x)p{x) (y dyp{y\x) - l) + (43) 



where is defined by either (40) or (41). 
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To find whether the resulting Lagrangian admits a unique minimum or multiple local 
minima, we study its convexity properties. If the effective Lagrangian (43) is convex, a 
solution of the minimization problem, if exists, is unique. Recall that a functional F[f{x)] 
is convex if 

F[A/i(x) + XMx)] < XFlMx)] + XF[f2{x)] (44) 

for any distributions fi{x), f2{x) and 0<A<1, A=l — A. 
Convexity of the first term in (43) is easy to establish: 



D 



\pi{y\x) + Xp2{y\x)\\q\ - XD\pi{y\x)\\q] - XD \p2{y\x)\\q] 

M(y|a^) + ^P2{,y\x)\ 



= A / / dxdyp{x)pi{y\x) \og . 
+X J J dxdyp{x)p2{y\x) log ^^""'^y^'^^'^'^^y^'' 



(45) 



P2{y\x) 



< 



because both terms are non-positive by Jensen's inequality (3). 

Consider now the effective potential term in (43). Convexity is easy to check for the 
(5-function prior (41) for which 



^df[Xpi + Xp2] -X^DF\pi] -X^df[P2\ 
XX 



JJdxpix)ipM^)-p2iy\x))Mx,y) 



< 



(46) 



Consider now the effective potential obtained with the conjugate prior (40). Since — log{y) 
is a convex function, convexity of the effective potential (40) is guaranteed if < a < 1 
and f3 = 0. For other values of parameters a, (3, convexity is lost. In particular, the choice 
of Jeffrey's prior (37) does not lead to a convex optimization problem. This docs not 
necessarily mean that it cannot give sensible results, but this case is bound to be more 
complicated than a convex problem, and will not be analyzed here. In what follows, the 
conjugated prior (36) will therefore be considered only in the parameter range < a < 1 
,/3 = 0. 



7 Minimization of Effective Lagrangian 

As it stands, the effective Lagrangian (43) appears to be considerably more complicated 
than the corresponding Lagrangian (13) of the canonical MCE method applicable when no 
noise is present. The difference is due to a non-linearity of the effective potential in (43), 
which is in contrast to a corresponding linear term in (13). The fact that this potential 
term is linear enables an analytical solution (15) of the variational problem in the classical 
MCE method, while a numerical scheme is needed only to fix the values of the Lagrange 
multipliers. On the contrary, in the present method there are no independent Lagrange 
multipliers, but the problem seems to boil down to a highly non-linear integral equation 
due to a non-linearity of the effective potential. 

With the MAP approach, a simple trick may be suggested that leads to a drastic 
simplification of the problem. The trick is akin to one used to treat non-linearities in 
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quantum field theory. We propose to linearize the effective potential in (43) by introducing 
new variables C/j, Aj and minimizing the following Lagrangian^ 



Lextb^i^U.X] = j dxp{x) J dyp{y\x) log^j^ - J dx^{x)p{x) (^J dyp{y\x) 



1 



piy\x) 
<l{y\x) 

+$[[/] +Y.^'AU'^ - I I dxdyp{x)p{y\x)Mx, y) + CA (47) 



where $[[/] stands for the effective potential (41) or (40) in which all combinations 
/ / dxdyp{x)p{y\x)fi{x,y) — Ci are substituted by Ui. 

^df[U] = E i^U^ ' ^CP[U] = E log [Ul] (48) 

It is easy to see that wc come back to the original effective Lagrangian (43) if we first 
minimize (47) over Ui and Aj With. p{y\x) ^ ^ fixed and substitute the result back into (47). 
The MAP solution p{y\x) is therefore determined by the following set of minimization 
equations: 

dp{y\x) 5^{x) dUi dXi 

First two equations in (49) yield 

Pxiy\x) = ^ exp (^E A./.(^, y)) (50) 

where 

Zx{x) = J dyq{y\x)ex.p (^Xifi{x,y)^ (51) 

Up to this point the solution is identical to one given by the canonical MCE method. The 
difference arises when it comes to a solution for the Lagrange multipliers Aj. Using (50) 
and the last two of equations (49), we end up with a set of non-linear equations for Xf 



J Jdxdyp{x)^^cxp(Y,X,Mx,y)jMx,y)^Q + U:{X,) , 



i^l,...,N (52) 



where U*{Xi) stands for a solution of the equation 



= -Xi (53) 



Equation (53) yields 



Ut{X,) ^-hafX, (DF) 

UtiK) =^ {CP) (54) 



''Here we rescale all terms by 1/h. 
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for the effective potentials (48), respectively. As one can see, for the choice ^{U) — 
^df{U) the solution smoothly goes to the canonical MCE solution in the limit — > 0. 

Let us now analyze convexity properties of the resulting problem of optimization in 
parameters {Aj}. By the back substitution of the MAP distribution (50) and solution 
(54) of (53) into (47), we arrive at the following function 

F(A) = min U,t\p,^,U,\\ = - / dxp(x) logZA(x) + *(A) (55) 

where 

^(A) = $(t/^) + ^A,(^/ + Q) (56) 

i 

and U*{Xi) is determined by Eq.(53). In particular, for the 5-function and conjugate 
priors we obtain 




2 V ' dXidXj 



^ ^ log + Y: + const , < (57) 

i i ^ J 



For the Hessian matrix we find 



dXidXj 



dxp{x) 



dyp\{y\x)fi{y)fj{y) - / dypx{y\x)fi{y) / dy'px{y'\x)fj{y') 



which is because a covariance matrix of constraints is positive definite for linearly indepen- 
dent constraints. We conclude that the function -F(A) (55) is concave for both choices of 
the effective potential, and therefore the solution of the linearized problem (47), if exists, 
is unique. Futhermore, we note that the derivative of the first term in (55) with respect 
to Aj is always non-positive for any value of Aj . Therefore, a nesessary condition that a 
maximum of function F[X) is attained at a finite value of Aj is that the derivative of the 
second term in (55) should be non-negative. This imposes a restriction on the range of 
possible values of Aji 

A, < Ci/al (DF) 

Xi <-^^ (CP) (59) 

We thus have reduced the problem of minimization of the effective Lagrangian (43) to 
a much simpler finite dimensional optimization problem (52)-(54) for the Lagrange mul- 
tipliers {Xi}, that may be solved by a multidimensional Newton- Raphson method. The 
computational cost of the method is therefore the same as that of the canonical MCE 
approach. 



14 



8 Adding Intermediate Time Levels 



The last stage of our construction is a calculation of conditional transition densities involv- 
ing additional time levels between the shortest and the longest maturities of European 
options in the calibrating set. In this section we will show how this problem can be 
consistently addressed using the MCE approach. 

In what follows, we will concentrate on the case when a single time level T is added 
between two adjacent maturity dates (T^ < T < Tk+i) in the calibrating set. The distri- 
butions p{x), p{z\x), p{z) of asset's prices X and Z at times and Tk+i, respectively, 
are supposed to be calculated from the data as described in the previous sections. We 
will also need prior densities q{y\x) , q{z\y) , q(z\x) which may be estimated from the p 
probabilities found previously by shifting time arguments as described above. We want to 
calculate the distribution p{y\x) = p{y,T\x,Ti;), p{z\y) = p{z,Tk^i\y,T). Having solved 
this problem, we could continue with the same method recursively and fill the model for 
all additional time levels of interest. 

One simple-minded approach to the problem would be to say that as soon as we know 
that p{y\x) 6{y — x) as T — j> and p{y\x) p{z\x) as T ^ T^+i, we could use linear 
interpolation to estimate transition densities for intermediate times: 

p{y\x) = {1- p)Ny{x,p)+pp{z\x)\^=y 

p(z\y) = (l-p)p(z\x)\^=y + pN,(y,l-p) (60) 

where p = {T — Tk)/{Tk+i — T^) and Nx{m,a'^) stands for the normal distribution in x 
with mean m and variance cr^. A problem with this approach is that such approximate 
transition densities would generally violate the Markov property 

P{x, y, z) = p{x)p{y\x)p{z\y) (61) 

and therefore would not satisfy the Chapman-Kolmogorov equation 

j dy p{y\x)p{z\y) = p{z\x) (62) 

and the consistency equation 

j dyp{y)p{z\y) = p{z) (63) 

Instead of trying to interpolate, we want to approach the problem using the Minimum 
Cross Entropy approach. This amounts to insisting that the asset price distribution should 
possess a minimum possible relative entropy with a reference density for arbitrary times, 
not only for maturity dates of options in the calibrating set. Within this framework, 
equation (62) can be imposed as a constraint on the MCE solution. The joint KL entropy 
that should be minimized is 

D\p{x,y,z)\\q{x,y,z)] = D[p{x)\\q{x)] + Dlp{y\x)\\q{y\x)] + D\p{z\y)\\q{z\y)] (64) 

where we used the chain rule and the Markov property. To find the MCE distributions 
p{y\x) and p{z\y), the entropy (64) should be minimized with the Chapman-Kolmogorov 
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equation (62) enforced as a constraint, and distributions p{x) and p{z\x) fixed. Writing 
the Lagrange multiplier as p{x)X{x,z) (which we may do as p{x) is fixed), we arrive at 
the Lagrangian 



D\p{x,y,z)\\q{x,y,z)]- J dxdzp{x)X{x, z) {^j dy p{y\x)p{z\y) - p{z\x) 

— [ dxdydzpix,y,z)\og ^\ — \— [ dxdz X(x, z) ( [ dyp(x,y, z) — p(x, z)](65) 
J q{x,y,z) J \J J 

Here we used the Markov property (61) to proceed to the second expression. Minimizing 
this Lagrangian with respect to p(x, y, z\ we obtain 



/ X (i{x,y,z) 
pyx, y, z) = p{x, z) 



(66) 



q{x,z) 

Note that it is crucial for our method that the density q is non-uniform (at least) in 
the y direction, as otherwise equation (66) would indicate that p(x, y, z) is uniform in y. 
This is reasonable because in such a case there would be no either prior information (via a 
reference density), or constraints on a distribution of Y . The present method thus adjusts 
the prior distributions q{y\x) and q{z\y) by enforcing the correct joint distribution of the 
end-point values p{x, z). 

Using the Markov property of the reference distribution, we find the resulting condi- 
tional densities 

^ I ^ ( \ \ [ ^ ^(^1^) ( \\ [ A ^ (liy\x)q{z\y) 1"^ 
p{y\x) = q{y\x) J dz p{x,z) J dydz — z7ZiZr\ P{x,z) (67) 



q(z\x) 

v\z\v) = q{z\y) / dx M x,z) 

J q[z\x) 



I 



dxdz 



q(z\x) 

q{y\x)q{z\y) 
q{z\x) 



p{x,z) 
p{x,z) 



(68) 



Equations (67), (68) solve the problem of finding the minimum cross entropy conditional 

densities p{y, T\x, T^) andp(2;, Tk^i\y, T; x, T^), given reference densities q{y\x), q{z\y), q{z\x) 
and fixed transition density p{x, z). If necessary, the same procedure can now be continued 
to get a more detailed partition of the interval [Tfc,Tfc+i] using the same approach. 



9 Numerical examples 

To illustrate the performance of our method, we generate an artificial "data" with a 
Monte-Carlo simulation. We will only consider a calculation at one time step of our 
scheme, and moreover use only the DF prior. A more detailed numerical analysis of our 
algorithm will be given elsewhere. We take Xq = 3.0 to be today's {t = 0) asset price. We 
consider N = 5 call options with strikes 

K = [2.0,2.5,3.0,3.5,4.0] 

Options mature at T = 1 year. Risk-free interest rate is r = 0.1. We assume that the 
"true", noiseless prices are given by the Black-Scholes model with volatility gbs = 0.25. 
This yields the following prices 

C = [1.19,0.78,0.45,0.23,0.11] 
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Implied transition distribution 

7^ 




1 .25 2.00 2.75 3.50 4.25 5.00 5.75 6.50 7.25 

implied distibution 

true distribution 
^ prior distribution 



Figure 1: Implied trandition density calculated with the DF prior, ay/h — 0.45, x — 4.0. 



We model the market noisy prices as follows: 



(69) 



where a„s measures the noise in the data, and N{0, 1) is a random number drawn from a 
normal distribution with zero mean and unit variance. In our simulation we have chosen 
(T„s = 0.15 (i.e. 15 % noise in prices). We have considered the following set of "market 
prices" obtained with such simulation: 

C = [1.07,0.68,0.53,0.22,0.10] 

Using this data as an input, we want to calculate with our method the transition density 
p{y,T\x,t) for t = 0.5. For the cumulative density p{x) at time t = 0.5, we will assume 
the lognormal distribution with the same "true" value 0-^35 = 0.25, while for the prior 
transition density q{y\x) we take the lognormal distribution with ass = 0.35. For the 
effective parameter ay/h controlling the relative weights of the entropy and likelihood 
terms, we have found that the interval aVh = [0.45, 3.0] provide a "stability plateau" 
between overfitting (for lower values of a^/h) and oversmoothing (for higher values of 
a^/h). Results for = .45 and a^/h = 3.0 arc shown on Figs.l to 2 and Figs. 3 to 

4, respectively. Figs. 5 and 6 illustrate the appearance of a fake binodal structure in the 
implied densities at a low value aVh = 0.1. For lower values of aVh, the algorithm blows 
up, in agreement with results by Buchen and Kelly [6]. This shows directly that juggling 
of an implied distribution in the canonical MCE method is due to overfitting. 
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Implied cumulative distribution 
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Figure 2: Implied cumulative density calculated with the DF prior. a\fh = 0.45. 



Implied transition distribution 
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Figure 3: Imphed tr audition density calculated with the DF prior, ay/h = 3.0, x = 4.0. 
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Implied cumulative distribution 
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Figure 4: Implied cumulative density calculated with the DF prior. a\fh = 3.0. 



Implied transition distribution 
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Figure 5: Imphed tr audition density calculated with the DF prior, ay/h = 0.1, x = 3.5. 



19 



Implied cumulative distribution 
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